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Abstract. The topic of statistical inference for dynamical systems has 
been studied extensively across several fields. In this survey we focus on 
the problem of parameter estimation for nonlinear dynamical systems. 
Our objective is to place results across distinct disciplines in a common 
setting and highlight opportunities for further research. 

1. INTRODUCTION 

The problem of parameter estimation in dynamical systems appears in many 
areas of science and engineering. Often the form of the model can be derived 
from some knowledge about the process under investigation, but parameters of 
the model must be inferred from empirical observations in the form of time series 
data. As this problem has appeared in many different contexts, partial solutions 
to this problem have been proposed in a wide variety of disciplines, including 
nonlinear dynamics in physics, control theory in engineering, state space model- 
ing in statistics and econometrics, and ergodic theory and dynamical systems in 
mathematics. One purpose of this study is to present these various approaches in 
a common language, with the hope of unifying some ideas and pointing towards 
interesting avenues for further study. 

By a dynamical system we mean a stochastic process of the form (X n ,Y n ) n , 
where X n+ \ depends only on X n and possibly some noise, and Y n depends only 
on X n and possibly some noise. We think of X n as the true state of the system 
at time n and Y n as our observation of the system at time n. The case when 
no noise is present has been most often considered by mathematicians in the 
field of dynamical systems and ergodic theory. In this case, all uncertainty in 
the system comes from the uncertainty in the initial state of the system, and 
the ability to estimate any parameters in the system may depend strongly on 
properties of the observation function f(X n ) = Y n , although such questions have 
rarely been addressed rigorously. State space models, considered most often by 
statisticians, lie at the other end of the noise spectrum, where both X n+ \ and Y n 
depend on some noise. Hidden Markov models, which have received considerable 
attention, provide a broad class of examples of these systems. In this setting, 
the statistical question of consistency for methods of parameter estimation has 
been studied, and some general results are available. The other two possible 
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assumptions on the presence of noise (assuming only dynamical noise or assuming 
only observational noise) have received relatively little attention, especially from 
the statistical point of view. While many of the proposed methods of parameter 
estimation for dynamical systems with observational noise have been studied via 
numerical simulations or on particular data sets, very few of these methods have 
been studied on a theoretical level. In fact, in the observational noise setting, 
basic statistical questions, such as whether a proposed method is consistent, have 
been considered only rarely, if at all, despite the fact that such systems capture 
important features of many experimental settings. 

Consider, for example, the question of parameter inference for models of gene 
regulatory networks. The underlying model often favored by biologists consists of 
a system of ordinary differential equations, with each variable in the state vector 
representing the expression level of a particular gene in the network. For some 
networks of interest, a significant amount of work has produced biological under- 
standing regarding the qualitative interactions between the genes in the network, 
but the corresponding ODE models still contain several parameters necessary for 
quantifying these interactions. Experimentalists are able to conduct experiments 
in which the expression levels of the genes in the network are measured at reg- 
ularly spaced instances of time. The resulting data may be interpreted as time 
series data generated by a system of ODEs with noisy observations. The param- 
eter inference problem in this setting consists of inferring the parameters of the 
ODE model from the observed data, and to the best of our knowledge there are 
no general statistical inference schemes for this type of problem that have been 
shown to be consistent. 

Another example of interest is identifying the behavior of a dynamical system 
on a network. In a variety of applications one considers nodes in a communication 
network and measures the states of these nodes (or properties of the nodes) over 
time. In many settings, one would like to detect drastic changes in the nature 
of the dynamic behavior of the system. This problem is of vital importance to 
a variety of security applications on networks, and it can be formalized as the 
inference of large changes in the parameters of the network - a change point 
model for a dynamic network. 

The objective of this article is to survey methodology across a variety of fields 
for parameter inference in stochastic dynamical systems. We first state the various 
goals of inference in dynamical systems. Our focus will be parameter inference 
and we provide a natural decomposition of parameter inference into four possible 
settings defined by the structure of noise in the system. We then state what is 
known in terms of rigorous results for parameter inference in these four settings. 
Of these settings the case of deterministic dynamics with observational noise is 
the least developed in terms of sound statistical theory and will be our focus. We 
also mention several important open problems for parameter inference in these 
types of systems. 

There is an extremely large body of work stretching across many disciplines 
that relates to the topic of statistical properties of dynamical systems. Although 
we attempt to provide references when possible, we make no attempt to be ex- 
haustive, and we recognize that in fact many references have been omitted. On 
the other hand, we hope that the references cited in this article may serve as an 
appropriate starting point for further reading. 
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2. BASIC DEFINITIONS AND PRELIMINARIES 

The most general setting that we will consider may be described as follows. Let 
A, X, y, and M be Polish [^spaces (complete metric spaces with a countable dense 
set), where each one is equipped with its Borel cr-algebra. The space A denotes the 
parameter space, the underlying dynamical system evolves in the space X and the 
observations take values in y. We consider a stochastic process (X n ,Y n ) n , which 
satisfies the following dynamics: for some a in A and Xq distributed according to 
a Borel probability measure on X, 

(2.1) Y n = f a (X n ,e n ), 

(2.2) X n+ \ = T a (X n , S n+ i), 

where 5 n +i is the dynamical noise and e n is the observational noise. The maps 
T : Ax X x M — > X and f : Ax X x M — > y determine the evolution of the state 
space dynamics and the observation process, respectively. We refer to a sequence 



(X n ) n satisfying (2.2) as a trajectory and a sequence (Y n ) n satisfying (2.1) as a 



sequence of observations. 

Let us start with the following definitions. 

Definition 2.1. A stochastic process (X n ) n is stationary if for any k, n and 
Tii, . . . , rifc in N, the joint distribution of (X ni+n , . . . , X nk+n ) is equal to the joint 
distribution of (X ni , . . . , X nk ) . 

Definition 2.2. An Af-valued stationary stochastic process {X n ) n is said to 
be ergodic if for every £ > 1 and every pair of Borel sets A,B£ X , 

1 n 

lim -Y j F((X 1 ,...,X e )£A, (X k+1 ,...,X k+e )€B 
k=l 

/ (li,..,ii)ei)F((4...,l { )e8). 

Definition 2.3. A measurable dynamical system is a triple (X, J 7 , T), where 
(X, T) is a measurable space and T : X — > X is measurable. A topological 
dynamical system is a pair (X,T), where X is a topological space and T : X — ¥ X 
is a continuous map. In the study of topological dynamics, one often assumes that 
X is compact and metrizable. 

Definition 2.4. A measure-preserving system is a quadruple (XjJ 7 , T, //), 
where {X,F,[i) is a measure space, T : X — > X is measurable, and ^(T~ 1 (A) s j = 
fj-(A) for each A in T . In this case, we say that T preserves the measure [i and \x is 
an invariant measure for T. For the purpose of this article, we will always assume 
that any invariant measure fj, is a probability measure. Also, if X is Polish and 
T is the Borel cr-algebra, then we may refer to (X, T, /x) as a measure-preserving 
system. 

Definition 2.5. A measure-preserving system (X, J-, T, /i) is ergodic if T _1 (A) 
A implies n(A) £ {0, 1} for any A in T . We may say that T is ergodic for fj,, or 
we may say that fj, is ergodic for T. 



1 This is a classical assumption in dynamical systems. 
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With the definitions given above, there is a correspondence between station- 
ary stochastic processes and measure-preserving systems. Let us describe this 
correspondence as follows. Suppose (X n ) n is an Af-valued stationary stochastic 
sequence, where X is Polish. Let y = Y\ n X, equipped with the product a- 
algebra induced by the Borel cr-algebra on X. Define T : y — > y by the left shift: 
if y = (x n ) n , then (T(y)) n = x n+ \. Kolmogorov's consistency theorem gives that 
there is a unique probability measure [i on y with the same finite dimensional 
distributions as (X n ) n . In this case, the stationarity of (X n ) n corresponds exactly 
to the invariance of \i with respect to T . Moreover, if (X n ) n is ergodic, then \i is 
ergodic for T. 

In the other direction, given any measure-preserving system (X, T, fj,), we may 
define a stationary stochastic process as follows. For any Polish space y and 
measurable map / : X — > y, let X n {u) = f(T n (ui)). If (X,T,fi) is ergodic, then 
so is (X n ) n . 

Recall that an ^-valued stochastic process (X n ) n is a Markov chain if for 
every x in X, there exists a probability measure vr(x, •) on X such that for each 
measurable set A in X, it holds that 

P(x n+ i e A\Xi = x 1 ,...,X n =x n j = Tr(x n ,A). 



In the model (2.1 )- p^2[ ), if the dynamical noise process (S n ) n is assumed to be 
i.i.d., then both (X n ) n and (X n ,Y n ) are Markov chains. This fact is particularly 
relevant in Sections [5] and [6j where the process (8 n ) n is assumed to be non-zero. 
Even in this case the process (Y n ) n may exhibit long-range dependencies. Setting 



the dynamical noise to zero in model (2.1)-(2.2) can be thought of as as a very 



degenerate Markov chain, but it is not clear in this case how helpful the Markov 
perspective is, since even the process (X n ) n may exhibit long-range dependencies. 

2.1 Goals of statistical inference 

There are a variety of topics that can be considered part of "statistical inference 
in dynamical systems." In the interest of providing context for this survey, let us 
mention the following topics: 

1. parameter estimation, model identification or reconstruction; 

2. state estimation, filtering, smoothing, or denoising; 

3. feature estimation, where features often include invariant measures, dimen- 
sions, entropy, or Lyapunov exponents; 

4. prediction or forecasting; 

5. noise quantification, estimation, or detection. 

In this paper we focus almost exclusively on the problems of parameter in- 



ference, system identification or reconstruction. In the setting of (2.1)-(2.2), we 



pose the parameter estimation problem as follows. Suppose the family of dy- 



namical systems can be parametrized by T a , with parameter a € A, as in (2.2). 



Construct statistical procedures for estimating the parameter a, given observa- 



tions Y\,Yi,--- ,Y n from (2.1), and provide adequate theoretical support for the 
validity of the estimation procedure. 

Of course, the boundaries between the problems mentioned above are often 
quite blurred. For example, if one can accurately estimate the hidden states 
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(^fc)fc=o f rom the data (Yfe)^Z , then the problem of system identification of- 
ten becomes significantly easier. For this reason, parameter inference methods 
often simultaneously attempt some version of state estimation or denoising. 

2.2 Organization of the paper 



We organize this survey according to which of the two types of noise in (2.1)- 



(2.2) are present (i.e. non-zero). This organization is motivated by the observation 
that methods and results for parameter inference in dynamical systems tend to 
be specific to the type of noise assumed in the model. 

The remainder of the paper is organized as follows. In Section [3] we describe 
some results relevant to inference for dynamical systems in the absence of noise. 
Section [4] contains a variety of proposed methods dealing with the case of dynam- 
ical systems contaminated by observational noise only. Section [5] deals with the 
case of only dynamical noise, and Section [6] addresses the setting of state space 
models, that is systems with both dynamical and observational noise. Lastly, we 
highlight some possibly interesting open questions in Section [7J 

Ornstein and Weiss [91] have shown that in a certain sense it is impossible, in 
general, to tell the difference between observational and dynamical noise. In this 
sense, one might suggest that from the point of view of abstract ergodic theory, 
we should not make distinctions on the basis of the type of noise present. How- 
ever, we are interested in finer properties than those captured by the equivalence 
relations considered in [91] , and therefore the distinction between observational 
and dynamical noise might still be useful for our purposes. 

2.3 Related surveys and books 

There have been many other reviews of topics related to the topics in this 
survey. An incomplete list of such reviews is the following: [7] [9] [Jj)J S3 H3 [50], 
[66] 1114} 1120] . Furthermore, let us mention the following books or monographs 
related to the topics in this survey: [TJ EJ QH [29] [62] [Ml [9JJ [JTH]. The relevance 
of this survey is that we bring together approaches from many distinct fields and 
discuss them in a common statistical setting. In particular we discuss parameter 
estimation and inference for the full range of noise settings. This perspective is 
rare since the different noise settings often correspond to different research areas 
such as deterministic dynamics or state space methods based on hidden Markov 
models. We bring these various approaches together and place them in a common 
context. Inference in dynamical systems for a variety of contexts was discussed in 
Berliner (1992) [7], and our survey can be thought of as an updated and greatly 
expanded version of this work. 

3. NO NOISE 



If no noise is present in the model (2.1)-(2.2), then we have the following 
situation: 

(3.1) Y n = fa(X n ) 

(3.2) X n+ 

i — T a (X n ) 1 

where T : A x X — > X is & parametrized family of maps and f : A x X — ^ is a 
parametrized family of observation functions. For a fixed parameter value a, the 



model (3.1 )-(3.2 ) is one of the classical objects of study in dynamical systems and 
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ergodic theory (for general references on dynamical systems and ergodic theory, 

see HDIE31 123 [1531 ). 

3.1 Non-parametric system reconstruction from direct observations 

Here we consider non-parametric estimation of a map T from direct observation 
of a single trajectory. Although the methods discussed in this section do not 
directly involve parameter estimation, they are nonetheless relevant for parameter 
estimation, since any non-parametric method for estimation of a map immediately 
yields a method of parameter estimation if the map to be estimated comes from 
a parameterized family. 

Let us first consider a case when the system can be successfully reconstructed 
from observations. If X is a manifold, T is continuous, the trajectory (x n ) n is 
dense in X, and we observe the trajectory directly (i.e. the observations {y n )n 
satisfy x n = y n ), then T can be consistently estimated from (y n ) n using locally 
linear functions of the data. More precisely, let us state a result from [4J justifying 
this statement in the case X = [0, 1]. Let A be Lebesgue measure on [0, 1]. The 
map T : [0,1] — >• [0,1] is said to be an E{Ij, Oj}-map if there exists at most 
countably many disjoint open intervals Ij and real numbers ctj such that A(UJj) = 
1 and f'(x) = ctj for all x in Ij. 

Proposition 3.1 ([!]). Let T be an E{Ij,aj}-map. Suppose the observed 
trajectory (x n ) n is dense in [0,1]. Then there exists a sequence of estimates T n 
of T such that for almost every x in [0, 1], it holds that T n (x) = T(x) for all but 
finitely many n. In particular, T n converges to T pointwise almost everywhere, 
and \({x : T n ^ T(x)}) tends to zero. 

To get an idea about how to prove this proposition, notice that for any two 
consecutive points x n and x n+ \ in the trajectory, the pair (x n ,x n+ i) lies on the 
graph of T. Therefore one may estimate T by linearly interpolating between 
neighboring points on the graph of T. 

When the map T is not assumed to be continuous but only measurable, esti- 
mation of T from discrete observations of a single trajectory has been carried out 
by Adams and Nobel [1] . In this work, the map T is assumed to preserve a Borel 
probability measure \i on X, and the system (X,fi,T) is assumed to be ergodic. 
Their main result may be stated as follows. 

Theorem 3.2 ([!]). Let (io be a reference probability measure on X that is 
assumed to be "known." Also assume that there is a "known" constant M such 
that 1/M < dfj,/dfj,Q < M. Let Meas^) denote the space of measurable functions 
from X to X. Then there is a estimation scheme (T n ) n (whose definition uses M 
and hq), where T n : X n — > Meas^), such that for fiQ-a.e. initial condition xq, the 
map T n (xo, . . . , x n -i) converges to T in a weak topology (i.e. fi(T~ 1 (A)AT^ 1 (A)) 
tends to zero as n tends to infinity for each Borel set A). 

The estimation scheme (T n ) n that appears in [3] is constructed using an adap- 
tive histogram method, which we discuss below. This paper also shows that under 
the same hypotheses the conclusion of the theorem is false if one requires that 
n({x G X : T n (x) T(x)}) tends to zero as n tends to infinity. 
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Here we give an idea of the estimation scheme used in the proof of Theorem 3.2 



The histogram method described here is actually from [88], which is very similar 



in spirit to the method used in the proof of Theorem 3.2 Assume that X £ Mr, 
and we fix a refining sequence {itk)k of finite partitions of X with some additional 
properties (see [I] for details). Let iTk{x) denote the cell in tt^ containing x. Given 
the first n terms of the trajectory (xj)™~Q, let 



En— 1 j 
j=Q X j+l 1 {x j £n k (x)} 
<Pn,k[.X) = _ x , 

where I{xj£n k (x)} 1S the indicator function of the event that Xj is in -n^x), and if 
the cell vr^(x) contains no points Xj, then 4> nt k(x) = 0. Now consider the empirical 
loss of (p n y. 

/ i n ~ 2 \ V 2 

Vn 3=0 



The estimates T n of T are adaptively chosen from among the <f> n j- according to 
An t k (using no and M). This method has the advantage that it works in quite a 
general setting (the only assumptions involve ergodicity and the Radon-Nikodym 
derivative with respect to a reference measure). On the other hand, it relies on 
the ergodic theorem for convergence, and therefore it appears very unlikely that 
it would have any general speed of convergence. 

3.2 Non-parametric system reconstruction from general observations 

In this section we consider approaches to system reconstruction when the ob- 
servations (y n )n are not necessarily equal to the trajectory (x n ) n . There is a 
vast amount of literature on the technique of system reconstruction via delay 
coordinate embeddings. These system reconstructions may be thought of as non- 
parametric inference of dynamical systems. Delay coordinate embeddings are a 
well-studied inference procedure to reconstruct dynamical systems that satisfy 
certain conditions. In this section we define delay coordinate embeddings, men- 
tion some of the main uses of these techniques, and provide some representative 
theorems that provide conditions under which these methods work. 

The eventual goal of delay coordinate embedding techniques is typically feature 
estimation, which we summarize as follows. If the underlying map T and the 
observation function are both smooth, then under generic conditions, a delay 
coordinate embedding allows one to construct a smooth map T such that T is 
related to T by a smooth change of coordinates. Under this scenario, T and T 
will share many features, including entropy, Lyapunov exponents, and fractal 
dimensions of corresponding invariant measures. As these features are considered 
important in many physical settings, such delay coordinate reconstructions have 
been extensively studied. 

To be specific, we consider a smooth map T : X — > X of a manifold X, with a 
smooth observation function / : X — > R. The data are assumed to be generated 
as follows: there is a trajectory (x n ) n such that x n+ \ = T(x n ), and we observe 
the data (y n ) n such that y n = f(x n ). The original idea to use delay coordinate 
embeddings to construct a system equivalent to (X, T) from the observations is 
due to Ruelle, at least according to the influential paper 
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Definition 3.3. A delay coordinate mapping of X into M m is a mapping 
F : X -> R m such that 

F(x) = (f(x),f o T T (x), ...Jo T^ m ~ l \x)), 

for some natural number r. The mapping F is said to be an embedding if it is a 
diffeomorphism from X to its image F(X), that is if F is a smooth injection and 
has a smooth inverse. 

The well-known theorem of Takens |116| (often called the Takens Embedding 
Theorem) may be stated as follows. 

Theorem 3.4 ( |116j ). IfT, f, andr satisfy certain genericity conditions and 
m > 2dim(Af), then F is an embedding. 

Let X = F(X) and f = F o T o F~ l . The fact that F is an embedding means 
that the system (X,T) is related to the system (X,T) by a smooth change of 
coordinates (given by F). In particular, invariants of (X,T) that depend on the 
differential structure of T (such as Lyapunov exponents or fractal dimensions of 
attractors) are equal to those of the system (X,T). 

In particular, given the data (yk)kZo> we may build time series data (s^)? q "* 
for the system (X, T) as follows: for k = 0, . . . , n — 1 — 7~(m — 1), let 

Sk = (Vki Uk+r, • • • , Vk+T{m-1))- 

Then the new time series (sk)k may be used to estimate invariant features of 
(X, T), which will be the same as those features of (X, Q). 

Takens's theorem has been generalized in various directions, such as filtered 
delay embeddings (see |109j . for example) or delay embeddings for stochastic 
systems (see |113j ). but we do not attempt to record all such results. However, 
the following generalization, due to Sauer, Yorke, and Casdagli, bears mentioning. 

Theorem 3.5 (|109|). Let A be a compact subset of X with box-counting 
dimension d. Let m > 2d. Suppose T , f, t, and A satisfy certain genericity 
conditions. Then the delay coordinate map F given above is an injection on A 
and an immersion on each compact subset of any smooth manifold contained in 
A. 

The advantage of this theorem over the Takens theorem is that the relevant 
dimension d might be less than the ambient dimension of X, in which case the 
number of coordinates m required in the embedding space may be less than the 
number of coordinates required by Takens's theorem. 

In order to use the delay coordinate method given only the data (yfc)^Z 0J one 
must choose an appropriate dimension m and an appropriate lag r. A variety of 
statistical techniques have been proposed to estimate the dimension m and find 
a suitable lag r (for example, see the book j62] or the collection [82]), but further 
pursuit of these topics lies outside the scope of this survey. 
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3.3 Results from ergodic theory 

In this section, we state some results from ergodic theory that are relevant for 
parameter inference. 

One of the most general results in this area is due to Ornstein and Weiss [93J . 
In this work, the authors consider the problem of estimation of stationary ergodic 



processes . (Note that in the setting of ( 3 . 1 )- ( 3 . 2 ) , if Xq is distributed according to 



an ergodic invariant measure for T a , then the observation process (Y n ) n satisfies 
exactly these conditions.) To make this problem precise, they consider the d met- 
ric on the space of such processes. Their main results may be stated as follows. 
First, they construct a procedure which, given a realization (^fc)^Zo of a pro- 
cess (Xk)k constructs a process Z n = Then they show that the sequence 
of processes (Z n ) n converges to (X^)}- in the d metric if and only if (X^^ is 
Bernoulli. Thus, they have shown that there is a consistent estimation procedure 
for the class of Bernoulli processes. Furthermore, they show that no estimation 
procedure can be consistent for the class of all stationary ergodic processes. 

In another direction, Ornstein and Weiss [92] show that entropy is the only 
finitely observable invariant in the following sense. Let J be a function from the 
class of finite- valued stationary ergodic processes to a complete separable metric 
space such that J is constant on isomorphism classes. The main result of [92] 
states that if J is finitely observable, then it must be a continuous function of the 
entropy. This result shows that there are strong restrictions on the possibilities 
for inference of isomorphism invariants. 

Gutman and Hochman |41] extend the results in |92| in several ways. They give 
several rich families of classes C of stationary ergodic processes such that if J is 
a finitely observable invariant on C, then J is constant. They also show that for 
every finitely observable invariant J on the class of irrational circle rotations, J is 
constant on the processes arising from a full measure set of angles. In particular, 
there is no finitely observable invariant for irrational rotations which is complete. 

There is a large body of work, often categorized as smooth ergodic theory, that 
seeks to understand the statistical properties of smooth (or piecewise smooth) 
dynamical systems. The typical setting is that one has a compact Riemannian 
manifold M and a smooth self-map / : M — > M. The manifold typically has 
a distinguished probability measure A, which one may think of as volume mea- 
sure on the manifold. The goal is to understand the asymptotic behavior of the 
trajectory {f n (x)} n for A-a.e. x. For a wide class of such systems [126] . often 
called (non-uniformly) hyperbolic systems, there is an invariant measure fi on 
M such that for x in a set of positive A-measure, the trajectories in x equidis- 
tribute with respect to li. In such cases, the measure \i is said to be a physical 
measure. Often the measure jx has some additional properties (it has no zero Lya- 
punov exponents and absolutely continuous conditional measures with respect to 
A on unstable manifolds), and in this case ji may be called an SRB (Sinai- Ruelle- 
Bowen) measure [127] . The ergodic theory of SRB measures is fairly well-studied, 
and many of their statistical properties have been analyzed. 

A related topic that has seen a great deal of attention recently is concentration 
inequalities for dynamical systems [THl E3 EH ESI ED]- These inequalities are 
used to study the fluctuations of observables for dynamical systems and have 
been shown to hold for sufficiently regular observables and a wide class of non- 
uniformly hyperbolic dynamical systems. Using these inequalities, it is possible 
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to perform statistical estimation of various features of the dynamical system. See 
the survey [16] for more details and precise statements. 

3.4 Parameter inference via synchronization and control 

Synchronization-based approaches to parameter estimation have appeared quite 
often in the physics and control systems literature [31 [TJH E3 C02, 128J. In situa- 
tions when these methods are used, it is common that no particular noise model 
is assumed. Indeed synchronization-based approaches are typically described as 
parameter inference methods in the noiseless setting, although they may be ap- 
plied in other settings. The main idea of synchronization-based methods is to 
insert a "control" term in the defining equations of the system that allows one to 
incorporate the data. The parameter estimation may then be framed as a large 
optimization procedure in which one tries to find trajectories of the system which 
are close to the data. 

The topic of parameter estimation in a noiseless setting is discussed directly in 
the work of Abarbanel, Creveling, Farsian, and Kostuk [2], and we review their 
approach in this section. The main issue in this context is that one only has 
access to the observations (Y n ) n , which might "hide" some information about 
the underlying system. The approach taken in [2] involves synchronization of 
the observations and the output of a model over the relevant time window. This 
approach may be summarized as follows. 



Suppose that X is in Mr and the system (3.1)-(3.2) has the following form: 



where X n ^ denotes the i-th coordinate of X n . The synchronization approach 
taken in [2] is to add a "control" term of the form k(Y n — X n< i) to first coordinate 
of the model as follows: 

X n +i,\ = T a> i(X n ) + k(Y n — X rh i) 
X n +i,i = T a: i(X n ), i > 1. 

For k > large enough, the data Y n and the first coordinate X n> i of the model 
trajectory will "synchronize." With a fixed k, the authors propose to estimate 
the parameter a and the initial state Xq by minimizing the following function: 

n-1 

C(a,X ) = Y,( Y n-Xn,l)\ 
3=0 

where the trajectory X n is computed starting at Xq = Xo. The purpose of adding 
the control term is to regularize the function C so that its minimum may be 
found efficiently. Of course, the trajectory X n associated with this minimum is 
not a true trajectory of the original system. Therefore the authors propose a 
synchronization method that allows the parameter k to depend on time. In other 
words, they propose to minimize the cost function 

n-1 

C{a,X ) = Y,{Y j -X 3A ? + kl 

3=0 
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subject to the constraints 

Xj+i,! = T^iXj) + kjiYj - X jA ) 
Xj+i,i = T at i(Xj), i > 1. 

Although this method has been observed to work sufficiently well in practice [2] , 
we remark that to the best of our knowledge there are no theoretical guarantees 
regarding the consistency or performance of this method. 

4. OBSERVATIONAL NOISE ONLY 



If only observational noise is present in the model (2.1)-(2.2), then the system 



(2.1)-rt2.2ft reduces to the following situation: 



(4.1) Y n = f a (X n ,e n ) 

(4.2) X n+l = T a (X n ), 

where (e n ) n is a noise process, T : A x X X is & parametrized family of 
maps, and f: AxXxM^-y is a parametrized family of noisy observation 
functions. Multiple authors explicitly argue for consideration of the observational 
noise model. For example, Judd [51] states that "the reality is that many physical 
systems are indistinguishable from deterministic systems, there is no apparent 
small dynamic noise, and what is often attributed as such is in fact model error." 
Furthermore, Lalley and Nobel [7T] remark that "estimation in the observational 
noise model has not been broadly addressed by statisticians, though the model 
captures important features of many experimental situations." 

A distinguishing feature of the observational noise model is that the process 
(X n ) n is deterministic, and therefore in general it exhibits a long-range depen- 
dence structure. Furthermore, this long-range dependence is still present beneath 
the noise in the observation process (Y n ) n . Such dependencies imply that tradi- 
tional statistical estimation techniques do not apply and may not work. As Lalley 
and Nobel state in [71], "though some features of denoising can be found in more 
traditional statistical problems such as errors in variables regression, deconvo- 
lution, and measurement error modeling (c.f. [12J ) , other features distinguish it 
from these problems and require new methods of analysis." In particular, they 
cite the facts that the covariates X n are deterministically related (as opposed to 
i.i.d. or mixing), the noise is often bounded (as opposed to Gaussian), and the 
noise distribution itself is often unknown. 



Example 4.1. Let X = [0,1], and let T a : X ->■ X be given by T a (x) = 
ax(l — x), with o in A = [0, 4]. This family of maps, known as the logistic family, 
has been extensively studied in a variety of settings. For a 6 [0,1], it is known 
that for all x in [0,1], the iterates T"(x) tend to as n tends to infinity. We 
say that a parameter value a has an attracting periodic orbit {po, . . . ,pn-i} if 
T a (pi) = pi + i (with indices interpreted modulo N) and \(T^)'(pi)\ < 1. For such 
parameter values, the iterates T^(xq) of Lebesgue almost every initial point xq 
will tend to the periodic orbit {po, . . . ,pn-i} as n tends to infinity. It is known 
|40| [74] that the set of parameter values that have an attracting periodic orbit 
is open and dense in [0,4]. On there other hand, there are parameter values 
that give rise to very different asymptotic dynamics. In particular, we say that a 
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parameter value a has an absolutely continuous invariant measure (acim) [i a if \x a 
is absolutely continuous with respect to Lebesgue and \i a is an invariant measure 
for T a . In such cases, it can be shown that the iterates T™(xo) of Lebesgue almost 
every initial point xq equidistribute with respect to fi a . Intuitively, the presence 
of /i a produces seemingly stochastic behavior, which is often referred to as chaos. 
Jakobson showed in [48j that the set of parameter values that have an acim has 
positive measure in [0,4], and Lyubich eventually showed in [75] that Lebesgue 
almost every parameter in [0, 4] either has an attracting periodic orbit or an acim. 

In most of the papers cited in this section, this family of maps is taken as 
a standard testing ground for parameter estimation methods. Generally, it is 
assumed that the observational noise is additive (i.e. f a (x, e) = x + a(a)e). 

4.1 Noise reduction 

One basic approach to parameter estimation in the observational noise case is 
to reduce the noise and then apply parameter estimation methods. If the noise 
can be uniformly and sufficiently reduced, then these approaches will be approx- 
imately as successful as the estimation method applied to the noiseless case. For 
example, the positive results in [B9~t 170"! 171] might be combined with a parameter 
estimation method in order to produce consistent estimates. Among the results 
contained in these works, the main positive result of [7T] is the most general, and 
we state it as follows. 

A homeomorphism F of a compact metric space (A, d) is said to be expansive 
with separation threshold A if for every x ^ y in A, there exists n in Z such that 
d(F n (x) , F n (y)) > A. In the work |71j . the authors consider an initial condition 
x and let Xi = F l (x). Also, they define a particular denoising algorithm which, 
given noisy additive noisy observations (yi)^IJj f, produces estimates Xi^ n of the 
true states X{. In this context, the main positive result may be stated as the 
following theorem. 

Theorem 4.2 (|7T]). Let F : A — > A be an expansive homeomorphism with 
separation threshold A > 0. Suppose that the noise process (e n ) n satisfies \e n \ < 
A/5 for every n. If k = k(n) —> oo and k/ login) — > as n tends to infinity, then 

j n— k 

TTT ^ \%i,n ~ Xi\ -> 0, OSn-f OO 

i=k 

with probability 1 for every initial point x in A (with respect to any F invariant 
Borel probability measure). 

By allowing a slight modification to their estimation scheme, the authors also 
show that under the same hypotheses 

max |xj ; „ — Xi\ — > 0, as n — > oo 

log(n) <i<n— log(ra) 

with probability 1 for almost every initial point x in A. 

Of course, the task of removing the noise might itself be difficult or in some 
cases even impossible, as witnessed by the negative results in [691 EB EI] and the 
related results in [52], [531 E3 ES] • Here we state the main negative result in [71]. A 
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pair of points x and x' is said to be strongly homoclinic for the homeomorphism 
F if 

Y J d{F k {x),F k {x')) < oo. 

fcez 

Theorem 4.3 ([71]). Suppose the stationary distribution for the noise process 
(e n ) n is unbounded (or has sufficiently large support). If x and x' are strongly 
homoclinic, then for every measurable function cf) : JT X — > M. d , 



E 



\4>({yn)n) ~X\- \4>((y' n )n) ~ x'\ 



> 0. 



In other words, even with access to the entire observation sequence, any state 
estimation or denoising scheme will fail with positive probability. 

Lalley and Nobel also point out that despite the fact that in such cases the 
problem of asymptotic denoising is impossible, it might still be possible to obtain 
consistent parameter estimates. In fact, they state that such examples might 



provide an interesting avenue for further study (see Question 7.1). 

In addition to the works mentioned so far in this section, the following works 
discuss the problem of denoising or smoothing data in the presence of only ob- 
servational noise: [221 E3 EHl EH EH E7J QUE] • 

4.2 Introduction to likelihoods and related methods 

We begin with the work of Berliner (£[[7], sets the stage for most of the work 
that has followed. In these works, the author is mostly concerned with the obser- 



vational noise setting (4.1 )- ( |4T2[ ) . The likelihood function is given by 

L(x ,a) = p(yQ~ 1 \x ,a), 

where p(Vq~ 1 \xo, a) denotes the likelihood of observing y$~ l given the parameter 
choice a and the true initial condition xq {i.e. p(-\xo,a) is the probability density 
for the observation process conditional on xq and a). The maximum likelihood 
(ML) method for estimating the parameter a amounts to defining the following 
maximum likelihood estimator (MLE): 

(4.3) a n = argmaxmaxL(xo, a). 

a x o 

It will be useful to find an explicit form for the likelihood function in the case 
that (I) the observational noise sequence (e n ) n is assumed to be i.i.d. normal with 
zero mean and unit variance, and (II) the observation function f a takes the form 
f a (x, e) = x + a(a)e. The function a(a) allows one to set the variance of the noise 
according to the parameter a. In this case, we have 

L(x ,a) = (a(a)v^)~ n exp - (x )) 2 / (2a 2 (a)) J 

V k=0 J 
and the corresponding log-likelihood function is given by 

n-l 



(4.4) logL(:c ,a) = -nlog(a(a)V2^) - ]T(y fc - (x )) 2 / (2a 2 (a)) 



k=0 
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A significant portion of the work on parameter estimation following Berliner has 
involved optimization of this log likelihood function, even when the noise is not 
necessarily Gaussian and thus its interpretation as a log likelihood function is no 
longer valid. 

As discussed in |98| . no existing statistical results apply to the ML method in 
this setting. With the above notation, the main difficulty in the current setting 
is that T a fe is a non-stationary function of k. Standard statistical results on the 
performance of the ML method apply when the likelihood function has no such 
dependence on k (or is periodic with respect to k), but these results do not apply 
o priori in the current setting. 

The Bayesian approach assumes a prior distribution (density) for xq and a, 
written as ir(xo,a). Given the data 2/q -1 , the posterior distribution is then 

ir(x a\y n ~ l ) = ^o" 1 ^ '") 71 "^ '") 

"" J p(yQ~ 1 \x,a)Tr(x,a) dxda 

In these basic definitions, Berliner considers three main methods of parame- 
ter estimation: maximum likelihood estimation, minimization of a cost function 
(which is often chosen to be the negative of the log likelihood function) and 



Bayesian estimation. One of Berliner's main points is that when the system (4.1 )- 



(4.2) is chaotic, the likelihood function will also typically be chaotic, in the sense 
that it will be extremely jagged. The rough nature of these likelihood functions 
makes all three of the above methods of statistical estimation computationally 
very expensive, and much of the work following Berliner has been motivated 
by the need to mitigate this difficulty. Beyond these computational difficulties, 
we would like to emphasize that to our knowledge there are no general results 
concerning the consistency of any of these likelihood-based methods. 

4.3 Variations on likelihood based methods 

A common method of parameter estimation in practice is to minimize some 
cost function C with respect to the parameters. Given the observations (yfc)fc=o> 
such methods employ the following estimators: 

a n = argmin a minC (xo, a, (2/fc)fc=n) > 

where C(xo, a, {yk^Zo) somehow measures the discrepancy of the observations 
and the system trajectory having parameter a and initial state xo- 

As we mentioned in the previous section, the most basic cost function is the 
least squares cost function 

n-l 

(4-5) C LS {x ,a, M"o) = " T " (*o)) 2 - 

fe=0 

Perhaps due to the sensitive dependence of C LS on xq and the additional 
computational expense incurred by minimizing C LS over xq, several authors con- 
sidered minimization of a one-step least squares cost function, given by 

n-2 

(4-6) C OSLS (a, {y k ) n k zl) = J> fc+1 - T a (y k )) 2 , 

k=0 
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Fig 1. Least Squares cost function for Xo in logistic family as a function of x £ [0,1] given 
n = 20 observations, true initial value xq = .4 and true parameter a — 4. 




Fig 2. Least Squares cost function for parameter a in logistic family as a function of a £ [0, 4] 
given n = 20 observations, true initial value xq — .4 and true parameter a — 4. 



which does not depend on any initial condition xq. This cost function may appear 
to be the familiar least squares function from regression analysis, but as Kostelich 
|67j recognized, it suffers from the problem of errors in variables (c.f. \13\ [38]). 
The problem of errors in variables is that the errors are not independent as is 
assumed by the cost function. Viewing C OSLS from the perspective of traditional 
regression, we see that y^ appears to play the role of the independent variable and 
Vk+i plays the role of the dependent variable, but both y\. and yu+i contain noise 



according to the model (4.1 )-(4.2). It is well-known that the problem of errors in 



variables can lead to asymptotically biased results, and therefore we should not 
expect minimization of C OSLS to give consistent estimates of the parameter a. 

In response to the errors in variables problem, Jaeger and Kantz [471 IUT] pro- 
pose a "solution" of the problem, which amounts to minimizing the following cost 
function that has since gone by the name "total least squares" cost function: 



n-l 



(4.7) 



C TLS {a,{yk)l =Q ) 



Vmin|(y fc ,y fc+ i) 
k=0 



(vMy))f 



Note that this approach essentially ignores the dynamics altogether, and instead 
focuses on minimizing the sum of orthogonal distances between the graph of T a 
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and the points (y k , 2/fe+i) in X x X. In order to include some aspect of the dynam- 
ics, they further modify their cost function to find local shadowing trajectories 
by considering cost functions of the form 



n— s— 1 

(4.8) C MTLS {a)= V mm\(y k ,...,y k+s ) - (y, . . . ,T°(y))\. 

Z J y 

Here s is a parameter of the method; it is the number of steps over which one 
considers the local shadowing trajectories. If one asks for global shadowing tra- 
jectories, corresponding to s = n — 2, then this modified total least squares cost 
function is equivalent to the original least squares cost function C LS . 

McSharry and Smith [81] consider the one step cost function C OSLS given by 



(4.6). They prove that in the case of the logistic map with a specific parameter 
value, the minimization of this cost function produces biased estimates, even 
with infinitely many observations. Their proposed solution involves minimizing 
the cost function given by 

(4.9) C MS (a) = -E lQ g(/ «p(~^) 

where d\{x) = \{y k , yk+i) — (x, T a (x))| 2 , e is the variance of the noise process (e n ) n , 
and jj, a is a particular invariant measure for the map T a . They argue that the min- 
imum of C MS provides more reliable parameter estimates due to its inclusion of 
information regarding the invariant measure [i a . It is perhaps a shortcoming of 
this method that one must know the variance of the noise process and the in- 
variant measure \x a in order to calculate C MS (a). In practice, the authors suggest 
approximating the integral with respect to fj, a by a sum over a long piece of tra- 
jectory simulated from the model in the hopes that this approximation will be 
close to the integral by the ergodic theorem. The authors provide numerical evi- 
dence that C MS provides better parameter estimates than either C OSLS or C TLS , 
although again no theoretical results are available to justify this comparison. 

Meyer and Christensen [83] , following up on the work of McSharry and Smith 
[81], propose to model the system using a combined noise state-space model of 



the form (2.1)-(2.2), and proceed via an MCMC algorithm for performing the 
inference. In particular, they take a Bayesian approach, modeling both the true 
states X n and the parameters a as unknown variables. They assume that the 
process (X n ) n forms a Markov chain (by adding dynamical noise to the model). 
Then they compute posterior probabilities of the unobserved variables using the 
Gibbs sampler and the Metropolis-Hastings algorithm. 

In his paper [51] ; titled "Chaotic-time-series reconstruction by the Bayesian 
paradigm: Right results by wrong methods," Judd discusses the Bayesian ap- 
proach of Meyer and Christensen [83] . and argues that their approach might work, 
but for "accidental" reasons. In particular, he objects to the fact that Meyer and 
Christensen have replaced the deterministic model by a stochastic model, and he 
claims to formulate the "correct" Bayesian approach for the deterministic model, 
which he acknowledges is essentially that given by Berliner [6] (presented in Sec- 
tion 4.2). He argues that their model only appears to give correct results because 
it happens to find shadowing trajectories of the true system. (We remark that an 
e shadowing trajectory for a sequence (z n ) n of states in X is a true orbit (x n ) n of 
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the system such that d(x n , z n ) < e for all n.) In the end, he argues for methods 
based on a direct search for shadowing trajectories, which purportedly require 
significantly less computational effort than the Bayesian approach of Meyer and 
Christensen. Such methods are often referred to as gradient descent methods; for 
further reading about these methods, see |544 1104] and references therein. 

Pisarenko and Sornette [98J consider the parameter estimation methods dis- 
cussed above, as well as the method of moments. They point out that the method 
of moments seems to be the only method so far considered whose asymptotic con- 
sistency has been rigorously proved on even a single example. They also provide a 
careful analysis of the work of McSharry and Smith. This analysis sheds light on 
some errors, both quantitative and qualitative, in the work [ST]. In order to pro- 
vide a useful estimation procedure, they propose a "pure" likelihood method, in 
which they cut the time-series data into n\ sub-intervals of length ni and perform 
ML estimation on each interval independently. In this method, the resulting n\ 
parameter estimates are averaged to produce a single estimate. The motivation 
behind their method seems to be the following. A theoretically true/pure ML 
method involves treating xq as a parameter to be estimated (as in Section 4.2), 
but the chaotic nature of the system means that the system forgets its initial con- 
dition exponentially quickly, which implies that it cannot be reliably estimated. 
Hence, they arrive at the method of chopping the time series into smaller pieces 
(which hopefully still contain useful information about the initial condition of 
each piece) and using the pure ML method on each piece. The statistician inter- 
ested in mathematical rigor is likely to find this work rewarding to read. A word 
of caution: they conclude their article by stating that "the situation is rather 
hopeless for the establishment of a meaningful statistical theory of estimation 
using the continuous theory of classical statistics to such discontinuous objects 
as the invariant measures of chaotic dynamical systems." 

Smirnov et al [112] note that the piecewise ML method of Pisarenko and Sor- 
nette [98] suffers from significant bias and potentially large variance, since it relies 
on chopping the data into many small subsets. In the case of one-dimensional 
maps, the authors propose a method based on backwards iteration of the map. 
They interpret their method as also relying on a ML principle, and they claim 
(with numerical support but no proof) that their method is asymptotically con- 
sistent with variance typically decreasing like n -2 , where n is the length of the 
observed time series. 

The work of Horbelt and Timmer [33] seeks to quantify the rate of convergence 
of parameter estimates to the true parameter value in the observational noise case 
as the number of observations grows. In the introduction, the authors claim that 
the MLE in this setting is unbiased and efficient, for which they refer the reader 
to an earlier version of the book Theory of point estimation by Lehmann and 
Casella [72], although it seems clear that this statement is mistaken, since in 
some cases it can be shown to be asymptotically biased. Nonetheless, the authors 
find numerical evidence for various scaling laws of the variance of the MLE. 

The work of Nakamura et al proposes yet another parameter estimation method 
in the observational noise setting [ST]. Here the authors suggest an iterative 
method that alternates between estimating the system states and the system 
parameters. In each of the optimization steps, they use somewhat standard tech- 
niques. To estimate the parameters, they minimize the cost function C OSLS in 
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(4.6) with respect to the parameter a. To estimate the states, they state that one 



could use any filtering method, such as the extended Kalman filter |12H I122j or 
gradient descent noise reduction (also known as gradient descent state estima- 
tion) |55} [66l 1104] . The novelty of their approach lies in the fact that they iterate 
between these two estimation steps. 

4.4 Method of moments 

Here we mention a method of parameter estimation that has been shown to 



be consistent at least for the logistic family, discussed in Example 4.1 For the 
observational noise model, this method, discussed in [98] , appears to be the only 
method that has been proved to be consistent for at least one non-trivial example. 

We consider the model fl^-pj]), where X = [-1, 1], A = [0, 2], and T a (x) = 
1 — ax 2 , which is change of coordinates of the family in Example 4.1 Assume 
that the underlying trajectory process {X n ) n is ergodic, which is the case if one 
assumes that Xq is drawn from an ergodic invariant measure \x a for the map T a . 
Alternatively, one may assume that a is chosen such that T a has an acim fi a (as 



discussed in Example 4.1) and Xq is drawn from Lebesgue measure. Also assume 
that the observational noise is additive (i.e. Y n = X n + e n ) and (e n ) n is i.i.d. 



Gaussian with mean and variance e 2 . For a sequence 



\ ELo z k and for an y / 

ergodic theorem 



(zk)kZo, let A 



I, let E„ a (/) = ff(x)dfi a (x). 



Zk) = 

Then by the 



(4.10) 
(4.11) 
(4.12) 
(4.13) 



lim A n {Y k ) =E„ a (x) 

n— »oo 

lim A n (Y k 2 )=E, a (x 2 ) 

n— >oo 

lim A n (Y£) = E Ma (x 3 ) + 3e 2 E Ma (x) 

n— >oo 

lim A n (Y k Y k+1 ) = E Mo (x) - aE Ma (x 3 ). 



Also, averaging the equation x n+ \ = 1 — ax 2 , we obtain that 



(4.14) 



E Ma (x) = l-aE Ma 



x 



Combining Equations (4.10)-( 4.14[ ), we arrive at the following estimates for the 



unknown parameters a, K fla (x), E Ala (x 2 ), E^ a (x 3 ), and e: 



a n = 



A n {Y k Y k+1 ) + 2A n (Y k ) + 3(A n (Y k )) 2 
2,A n (Y k )(A n (Y k )f - A n (Y k 3 ) 



E Ma (x) n = A n (Y k ) 



Eu a (x 3 ) 



e n = 



— A n (Y k ) — e n 

= ^(A n (Y k )-A n {Y k Y k+l )) 

A n (Y k 3 )- E, a (x 3 )n 



3A n (Y k 



These estimates are consistent by the ergodic theorem, but they might converge 
quite slowly, as there is no general rate of convergence in the ergodic theorem. 
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5. DYNAMICAL NOISE ONLY 



If only dynamical noise is present in the model (2.1)-(2.2), then the system 



(2.1)-rt2.2fo reduces to the following situation: 



(5.1) Y n = f a (X n ) 

(5.2) X n+ \ = T a (X n ,5 n ), 

where (5 n ) n is a noise process, T : A x X x M — > X is a parametrized family of 
noisy maps, and f: AxXxM^y is a parametrized family of observation 
functions. The dynamical noise model has been studied in the dynamical systems 
literature under the name "random dynamical systems" (see [65] and references 
therein). The process (X n ) n forms a discrete-time Markov chain on the continuous 
state space X (see the book of Meyn and Tweedie [84J and references therein) . In 
this case, some of the estimation methods from the statistical literature on time 
series and state space models may apply. 

Without using this Markov structure, Adams and Nobel have studied the non- 
parametric reconstruction of such systems from direct observations (i.e. Y n = X n ) 
|88| 189] . In particular, they used adaptive histogram methods to show results 
similar to those regarding non-parametric reconstructions of systems with no 



noise, as in Section 3.1. These methods do not work in the observational noise 



case precisely because in that setting they suffer from the problem of errors in 



variables, as discussed in Section 4.3 



A common setting for random dynamical systems is to assume that there is 
a map T : X — > X, where X is a compact manifold and T is smooth, with a 
"natural" invariant probability measure \x. In common examples, T might be a 
(non-uniformly) hyperbolic map and [i might have the property that almost every 
initial condition with respect a volume measure on the manifold equidistributes 
with respect to fx. In such cases, one typically adds dynamical noise as follows. 
Let e > 0. For each x in X, let P e (x, •) be the uniform measure on the ball of 
radius e about the point T(x). Then the Markov chain corresponding to this 
random dynamical system is determined by viewing P e as the transition kernel 
for the chain. Under some conditions, the chain corresponding to P € will have 
a unique stationary distribution, fi e . A well-known result (see [65]) states that 
under certain conditions, the measure fi e converges to fi weakly as e tends to 0. To 
the best of our knowledge, no theoretical work on parameter estimation has been 
conducted for this particular setting, perhaps making it an area ripe for progress. 
On the other hand, this setting may be viewed as a particularly degenerate version 
of the general state-space setting, in which there is no observational noise, and 
therefore all methods described in Section [6] may also be applied here. 

6. GENERAL STATE SPACE MODELS 



In this section we consider the full system (2.1)-(2.2), where both dynamical 
noise and observational noise are present. Specific versions of such models have 
long been considered in the statistics literature, where they are known as state 
space models |29J . The literature on state space models in both applied and theo- 
retical statistics is extensive and |42t 199] are two excellent texts covering applied 
modeling on this topic. The models can be summarized as the study of hidden 
Markov models (HMMs) in general state-spaces. (For an article discussing the 
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connections between ergodic theory and finite state HMMs, see [9].) Theoretical 
understanding of general HMMs has been a challenge and rigorous statements on 
consistency in parameter estimation have only appeared recently [27J (see Section 



6.2 ). Most of the work in this area has been devoted to the problem of state esti- 
mation or filtering, and even at a computational level the problem of parameter 
estimation is still largely unsolved. In this section we survey some of the most 
studied approaches to filtering and discuss parameter estimation where there are 
results. 

6.1 Kalman filter and some generalizations 

The simplest such models assume that the dynamics are linear and the noise 
is additive Gaussian: 

X n +i = AX n + B5 n+ \ 
Y n = CX n + De n , 

where here A, B, C, and D are all matrices of the appropriate dimension and 
(5 n ) n and (e) n are independent i.i.d. Gaussian processes. In this case, the optimal 
solution to the state estimation or denoising problem is given by the well-known 
Kalman filter [61\ 159) . Generalizations of the ideas behind Kalman filtering to 
non-parametric models have been an extensive area of research in Bayesian and 
frequentist inference [29], [37] ESI ES] • 

Conceptually, the simplest generalization of the Kalman filter to nonlinear 
models involves linearizing the models at each time point and then using the 
Kalman filter. This method is often called the extended Kalman filter (EKF) 
|49[[5]. While the Kalman filter is optimal in the sense that is the minimal- variance 
unbiased estimator, the general EKF is known to be biased. Furthermore, due to 
the linearization of the model, the propagation of the error covariance estimates 
may behave quite poorly if the non-linear terms in the model are significant. 

The unscented Kalman filter (UKF) |57L 5BJ provides a deterministic sampling 
scheme that has been observed to outperform the EKF. The basic idea behind the 
UKF is that instead of approximating the model by linearization, one ought to 
use the exact model but approximate the posterior distributions by Gaussian dis- 
tributions. The sampling scheme is designed to insure that the first two moments 
of the posterior distributions match the first two moments of the approximating 
distributions. It is believed that the UKF outperforms the EKF because it may 
be viewed as an unbiased second-order method, whereas the EKF is a biased first- 
order method. Of course, the UKF is believed to have shortcomings of its own; 
in particular, it assumes that the posterior distributions are Gaussian, which is 
certainly not the case in general. Also, the number of samples required for the 
UKF is at least the dimension of the state space, and in high-dimensional settings 
this fact makes the UKF computationally intractable. A wide variety of Monte 
Carlo (MC) methods have been proposed to overcome these issues. 

Another generalization of the Kalman filter is known as the ensemble Kalman 
filter (EnKF) (32j HU [33] . This method is a Monte Carlo method that is particu- 
larly popular in the weather prediction community. In fact, this method may be 



thought of as a type of particle filter (see Section 6.4.1). 



21 



6.2 MLE for HMMs 

If one is willing to consider point estimates of unknown parameters in a setting 
where the likelihood function is known, then one can consider the maximum 
likelihood method (MLE) for parameter estimation. Let us now state the main 
result of the paper |27j . which gives sufficient conditions for the consistency of 
MLE in this context. Let (X k , Y k )^ =l be a hidden Markov model (HMM) of the 



form (2.1)-(2.2). Let a* denote a fixed parameter value in A. Assume that the 
HMM with parameter a* has a unique stationary distribution, and let P a * be 
the corresponding stationary HMM. Denote by p u (yQ,a) the likelihood of the 
observations Yq with initial distribution Xq ~ v and parameter a. Consistency 
of the maximum likelihood estimator (MLE) may now be stated in the following 
form: if a n = argmax a p v (yQ , a), then a n converges P a *-a.s. to a* as n tends to 
infinity. The main result of |27j gives some general conditions under which the 
MLE is consistent in this sense. A precise statement of these general conditions 
is beyond the scope of this survey. 

6.3 Bayesian inference 

Recall the Bayesian formulation of state space estimation or filtering. Here 



one assumes that the model (2.1)-(2.2) gives rise to probability densities ii{x$), 
p(x\x'), and q(y\x), which define the initial distribution, transition kernel, and 
marginal distribution of the observation process, respectively. The densities are 
with respect to some fixed reference measures denoted dx and dy. In this frame- 
work, we are given access to finitely many observations y^ -1 , and we would like 
to estimate the true trajectory Xq -1 . Our assumptions define likelihood functions 

n-2 

pO^o" 1 ) = M^o) JJp(x fc+ i|x fc ), 

k=0 

and 

n-1 
fc=0 

Given the observations yn~ l , the posterior distribution for X^~ l is given by 



n-lN 



p(y n 

where 

^- 1 ,^ 1 )=^- 1 )p( yo "- 1 |xo" 1 ) 

.n— 1\ / „/'„n— 1 „.n—l\ jn—1 



There are a few instances when these distributions may be calculated ana- 
lytically, such as when the system is linear and the noise is Gaussian or when 
{X n } n is a finite state Markov chain. Outside of these cases, there is no ana- 
lytical method for calculating the posterior distribution, and therefore one seeks 
a numerical approximation for this distribution. With the significant advances 
in computational power in recent years, there has been a remarkable amount of 
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research devoted to finding efficient computational approaches to approximating 
such posterior distributions. In the remainder of this section we briefly discuss 
the linear Gaussian case and a few of its generalizations to nonlinear or non- 



Gaussian situations. Section 6.4 discusses some of the more recent computational 
approaches to filtering. 

An interesting work in the Bayesian context is |110| where the author studies 
posterior consistency for dependent data from an information theoretic point of 
view. The author establishes posterior consistency for misspecified models under 
the assumption of asymptotic equipartition property. For finite state space ergodic 
models, this is implied by the Shannon-McMillan-Breiman theorem. It could be 
interesting and useful to extend the ideas from |llOj to prove posterior consistency 
in parameter estimation for more general dynamical systems. 

6.4 Inference for dynamical systems via simulation based methods 



In the general non-linear, non-Gaussian state-space setting of (2.1)-(2.2), the 
posterior distributions for Xq _1 are not available in closed form, as they involve 
some integrals for which no analytical evaluation methods exist. In order to per- 
form inference in this setting, a great deal of effort has been devoted to developing 
sophisticated computational algorithms for for sampling from these posterior dis- 
tributions. One general idea is to use Monte Carlo (MC) methods to estimate the 
integrals of interest. It is worth emphasizing that there has been a huge amount 
of work in this direction, and we do not claim to provide a comprehensive survey 
of all the relevant results. For an introduction to MC methods, see the book [105J. 

6.4-1 MCMC methods, SMC and Particle Filters. If one cannot sample from 
the posterior distribution directly, then one often turns to Markov chain Monte 
Carlo (MCMC) methods. For a discussion of such methods, see the books [1051 
I124j and references therein. Such methods have been used for parameter estima- 
tion in dynamical systems {e.g., |21j). 

Traditional Monte Carlo or MCMC methods may be used to perform "batch" 
inference, i.e. when all of the observations are available at once and one would 
like to estimate j?(xq~ 1 Ij/q" 1 ) for fixed n, although even in this setting they might 
be prohibitively computationally expensive. When the goal is to perform "on- 
line" or sequential inference, or in an effort to try to reduce the computational 
expense, one might try sequential Monte Carlo methods (SMC) and their many 
variations. A particularly popular version of these methods is known as particle 
filtering. For a well-written, thorough introduction to the principles of sequential 
Monte Carlo (SMC) and particle filtering methods, see the recent tutorial by 
Doucet and Johansen [28J. For an incomplete list of works concerning SMC and 
particle filtering, as well as their adaptations to parameter estimation, see 
[Ml EHl 133 IMl Sll EOl IZ3 [S3 ISOl Q3B3 EE] • The basic idea is that the posterior 
distributions of interest are approximated by a finite collection of N samples, 
called particles, which are recursively propagated through the model. The main 
theoretical advantage of these methods is that one is often able to establish 
the convergence of the approximations to the true posterior distributions as the 
number of particles N tends to infinity. 

6.4-2 ABC methods. Most of the methods mentioned previously in this sec- 
tion rely on explicit knowledge and evaluation of the likelihood function. In 
many situations, such as in high dimensional complex models, the likelihood 
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function may not be available or is computationally expensive to evaluate. In 
such scenarios, a simple computational method called approximate Bayesian 
computation (ABC) offers a powerful alternative to conduct statistical infer- 
ence. ABC was first proposed as a philosophical argument in |107j and intro- 
duced to population genetics in |117| . Since then these methods have become 
extremely popular in many applied fields. A partial list of references include 
[231 E3 EH [961 [Ml EES EH IISl \IM ■ A g°° d review with applications to filter- 
ing is |46| . Briefly speaking, in ABC methods one first draws a parameter value 
9* from the prior distribution and generates synthetic data from the likelihood 
model corresponding to 6* . If the synthetic data "is similar to" the observed data 
(measured in some metric) up to a prespecified tolerance then 8* is accepted as 
a draw from the (approximate) posterior distribution. Choosing the metric and 
the tolerance level are difficult problems, but partial results are known (|35j). 

An important point to note is that in many examples, a summary statistic 
instead of the original data set is used for matching. This clearly results in loss of 
information (and sometimes even results in invalid inference; see |106| ) and thus 
raises the interesting question about when one can perform consistent model 
selection using the ABC methodology. In [78] a sufficient criteria is worked out, 
but clearly more needs to be done especially in the context of dynamical systems. 

7. OPEN QUESTIONS AND FUTURE DIRECTIONS 

Here we list some open questions related to parameter inference in dynamical 
systems and discuss possible future research directions. 

The first question examines if parameter estimation is possible even if denoising 
is impossible. 

Question 7.1. As shown by Lalley and Nobel [71], there are instances in 
which state estimation or denoising in the observational noise setting is impossi- 
ble. Is it possible to exhibit a family of topological dynamical systems (X, T a ) on 
a compact metric space X such that consistent denoising is (provably) impossible 
but consistent parameter estimation is nonetheless (provably) possible? 

The most common method in theory and practice for parameter estimation is 
ML. It is open if ML in the observational noise setting is consistent. A related 
question is can the approach to proving consistency results for HMMs in [27] be 
adapted to the observational noise setting. 

Question 7.2. Recall the definition of the ML estimator of the parameter a 
given in (4.3): 

a n = argmaxmaxL(xo, a), 

a x 

where L(xo,a) is the likelihood of xq and a conditional on the observations 



(yk)k=n- I n * ne observational noise case setting (4.1)-(4.2), what are necessary 



and sufficient conditions on the system such that a n converges to a with proba- 
bility 1 (for almost every initial condition xo with respect to an ergodic measure 
//)? If necessary and sufficient conditions are out of reach given current tools, 
partial answers to this question in the form of general sufficient conditions might 
also be interesting. 
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In order to get finite sample error bounds, one would also like to know about 
the deviations of the MLE from its average. This line of reasoning leads to the 
following question. 



Question 7.3. In the observational noise setting (4.1)-(4.2), under which 
conditions on the system is it true that the MLE is asymptotically normal in the 
observational noise setting? 



Since the method of moments presented in Section 4.4 is currently the only 
example to our knowledge for which consistency of any parameter estimation 
method can be proved in the observational noise setting, it is worth considering 
how it might be generalized. 



Question 7.4. Can the method of moments presented in Section |4.4| for 
the logistic family be generalized? Under what conditions is it applicable and 
consistent? 

In the combined noise setting of Section [6j it is still the case that the issue 
of parameter inference has not been satisfactorily resolved. Certainly any filter- 
ing method may be trivially extended to a parameter estimation algorithm by 
extending the state space to include the parameters, but in such cases the degen- 
eracy of the extended system typically causing the filtering methods to fail. Let 
us paraphrase a question in [28J. 

Question 7.5. Under what conditions on the model are there efficient algo- 
rithms for parameter estimation in the general state space setting? What theo- 
retical guarantees can be given to justify such algorithms? 

The range of applications of statistical inference methods for deterministic 
dynamical systems seems to be increasing rapidly. These systems present signif- 
icant new challenges, since the deterministic systems may have very long-range 
dependency structures. It would be a significant breakthrough if methods could 
be developed that provided asymptotically consistent algorithms for parameter 
estimation; moreover, one would like to have finite-size sample bounds on the 
accuracy of these algorithms. Given the difficulty of dealing with the long-range 
dependencies present in general in the observational noise model, it appears likely 
that the traditional methods of parameter inference may not work particularly 
well in this setting, and therefore new ideas and methods should be developed. 

One possible approach would be to consider a weakened notion of consistency. 
For example, one could consider a parameter estimation method to be consistent 
if it returns a set of plausible parameters that asymptotically contains the true 
parameter. Such weakened notions of consistency might be necessary for providing 
some theoretical justification of parameter estimation algorithms when achieving 
strong consistency appears out of reach. 

Let us close with one recent development in the field of dynamical systems and 
ergodic theory that might be useful in obtaining such rates of convergence. The 



concentration inequalities mentioned at the end of Section 3.3 provide a powerful 
method for obtaining finite sample error bounds for a wide class of statistical 
estimators for a wide class of dynamical systems. One might hope that these 
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concentration inequalities can be used to get rigorous error bounds for parameter 
estimation algorithms. 
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